fix(tools/talis): fibre-experiment race fixes & loadgen tooling by walldiss · Pull Request #3305 · evstack/ev-node

walldiss · 2026-05-02T23:45:15Z

Issues

Operating the talis-driven Fibre throughput experiment surfaced three classes of bug that consistently broke fresh cluster bring-up and made one set of test runs unrepeatable:

Bootstrap races against an unready chain. set-host and deposit-to-escrow were one-shot calls — --yes returns the txhash before block inclusion, so the operator saw "success" while the tx silently bounced (celestia-app is not ready; please wait for first block). All subsequent steps then ran against an unregistered host with no escrow.
Non-atomic keyring scp on bootstrap-evnode. scp -r creates keyring-test/ on the target before any of its files transfer; the evnode init script's [ -d keyring-test ] poll then triggered mid-transfer and evnode launched with a partial keyring (fibre-0.info: key not found).
talis up lied about partial failures. When some EC2 instances failed to launch, CreateAWSInstances returned nil error and talis up printed "deployment complete" against a partial cluster, breaking later steps with confusing errors.

A few smaller bugs also fixed here:

fibre_setup.go queried celestia-appd query fibre escrow (which doesn't exist — it's escrow-account), so the verification loop wedged for 3 minutes regardless of actual state.
sshExec used cmd.CombinedOutput, mixing SSH Warning: Permanently added chatter into bytes the CLI handed to Sscanf("%d") — the providers cross-check parsed those warnings as 0 and looped to its 5-minute deadline even when SSH-direct returned 3.
evnode-txsim used the stdlib default MaxIdleConnsPerHost=2, so any --concurrency > 2 opened fresh TCP/TLS sockets per request and turned the loadgen itself into the binding constraint.

Solution

fibre_setup.go: chain-ready preamble, retry-until-confirmed loops for set-host and fibre-0 deposit-to-escrow, switched verification query to escrow-account and key on "found":true, CLI cross-check that all validators registered.
fibre_bootstrap_evnode.go: stage keyring under /root/.keyring-fibre.staging/ then mv into place atomically.
New fibre_experiment.go umbrella command driving up → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnode so the operator can't skip a verification step.
aws.go: CreateAWSInstances now returns a joined error on any partial failure.
download.go: sshExec returns stdout only via cmd.Output() and adds -q -o LogLevel=ERROR to silence SSH warning chatter.
evnode-txsim: bump per-host idle conns to 2 × concurrency; always-on pprof on 127.0.0.1:6060 for in-process profiling under live load.

Test plan

talis fibre-experiment runs unattended end-to-end against a fresh cluster
talis CLI returns non-zero when talis up partial-fails
evnode-txsim saturates with --concurrency 32 (verified via pprof — request hot path is now write(2)/body copy, not TCP handshake)

Three race conditions surfaced repeatedly on a fresh AWS bring-up of the Fibre throughput experiment. Each one had the same shape: a talis subcommand "succeeded" at the CLI level (or returned the txhash with --yes) before the chain had actually applied the work, leaving downstream steps to fail in confusing ways. This commit makes each step verify *outcome*, not just *invocation*, so the experiment can go from a fresh `talis up` to a running loadgen without manual intervention. • setup-fibre script (fibre_setup.go) now: - polls `celestia-appd status` for `latest_block_height>0` before submitting any tx — fixes the silent-noop where set-host + 100× deposit-to-escrow all bounced with "celestia-app is not ready; please wait for first block"; - retries `set-host` in a loop until the validator's host shows up in `query valaddr providers` — fixes the case where --yes returns the txhash before block inclusion and the tx silently lands in the mempool but never confirms; - verifies fibre-0's escrow account is funded on-chain before the tmux session exits — same silent-failure mode as set-host, but on the deposit side. The talis-CLI step also now cross-checks all validators are registered from a single vantage point before returning, so a concurrent set-host race surfaces as an error instead of a half-empty provider list start-fibre would cache forever. • fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages the keyring scp into a tmp directory and `mv`s it atomically into place. The previous direct `scp -r` to /root/keyring-fibre/keyring-test created the directory before transferring its contents — the evnode init script's `[ -d keyring-test ]` poll passed mid-transfer, the daemon launched with no fibre-0.info, and crashed with `keyring entry "fibre-0" not found`. • evnode_init.sh (genesis.go) now waits for the specific keyring-test/fibre-0.info file rather than just the keyring-test directory. Belt-and-braces: the bootstrap mv is already atomic on the same filesystem, but the file-level guard means a hand-pushed keyring (not via talis) can't trip the same race. • New `talis fibre-experiment` umbrella command runs up → genesis → deploy → setup-fibre → start-fibre → fibre-bootstrap-evnode in order. Each step uses the same binary as a subprocess; failures in any step abort the chain. Operator goes from a prepared root dir to a running loadgen with one command, instead of remembering the sequence. Verified by 5-min sustained loadgen against julien/fiber HEAD with PR evstack#3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok, up from the prior 24.57 MB/s baseline (the gap is PR evstack#3287's overlapping uploads — these talis fixes just stop the deploy from silently breaking before throughput matters).

Three follow-up bugs surfaced from the PR evstack#3303 follow-up verification run on a 3-validator AWS Fibre cluster: - aws.go: CreateAWSInstances exited 0 even when individual instance launches failed, so `talis up` lied about success and downstream steps proceeded against a partial cluster. Returns a joined error now so failure cascades stop early. - download.go: sshExec used cmd.CombinedOutput, mixing SSH warnings (the "Warning: Permanently added '...'..." chatter on stderr) into bytes the caller hands to fmt.Sscanf("%d"). The CLI-side providers cross-check parsed those warnings as 0 and looped until its 5-min deadline even though a direct SSH query showed all 3 providers registered. Switch to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR` to silence the chatter for any caller that does combine streams. - fibre_setup.go: the per-validator escrow verification used `celestia-appd query fibre escrow` which doesn't exist — the actual subcommand is `escrow-account`. The query errored on every retry, the grep for "amount" never matched, and the script wedged on the 3-min deadline reporting `FATAL: fibre-0 escrow not present`. Switch to `escrow-account` and key on `"found":true` (the explicit existence flag in the response). Also wrap the fibre-0 deposit-to-escrow itself in a retry loop matching set-host — same `--yes`-returns-before-inclusion silent-failure mode bit it. fibre-1..N stay best-effort.

Two diagnostic improvements for the load generator: 1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib. With --concurrency=8 (or higher), 6+ goroutines per cycle had to open fresh TCP+TLS sockets per request because the pool couldn't hold their idle conns between requests. Bump MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to 2*concurrency so every active sender has a reusable keep-alive socket, eliminating handshake churn from the hot path. 2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a load tester, not a production daemon, so cost of always serving profiling is acceptable; the payoff is being able to grab CPU profiles under live load without re-deploying the binary — `ssh -L 6060:127.0.0.1:6060 root@loadgen \ go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`. A profile captured this way under c=8 traced the per-request hot path: 25.5% in kernel write(2), 25% in net/http body marshaling. That diagnostic surfaced that the c6in.2xlarge loadgen was the binding constraint for the experiment at ~22 MB/s, not evnode or DA — a finding we'd have spent another debug round chasing without the in-process profiler.

coderabbitai · 2026-05-02T23:45:24Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c97e7c08-2a60-4f33-b0a2-bbf95f1f8fcd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: Turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value).

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

walldiss added 3 commits May 3, 2026 01:41

github-actions Bot assigned walldiss May 2, 2026

julienrbrt merged commit 513ce9b into evstack:julien/fiber May 3, 2026
17 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tools/talis): fibre-experiment race fixes & loadgen tooling#3305

fix(tools/talis): fibre-experiment race fixes & loadgen tooling#3305
julienrbrt merged 3 commits intoevstack:julien/fiberfrom
walldiss:pr1-talis-fibre-fixes

walldiss commented May 2, 2026

Uh oh!

coderabbitai Bot commented May 2, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

walldiss commented May 2, 2026

Issues

Solution

Test plan

Uh oh!

coderabbitai Bot commented May 2, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants